[WIP] a proposal to document all datasets and models #163

campoy · 2018-03-09T21:27:28Z

Signed-off-by: Francesc Campoy [email protected]

Signed-off-by: Francesc Campoy <[email protected]>

campoy · 2018-03-09T21:30:31Z

Initial proposal, pretty vague for now but establishing what I think should be done as a first step.

Please provide feedback on the content, not necessarily the form or typos (we'll fix those later on)

Specially interested on @eiso and @vmarkovtsev's opinion, but feel free to provide yours too!

vmarkovtsev · 2018-03-14T11:18:31Z

This is to notify that I am here and struggling to find the time for a proper review. ETA is Friday.

campoy · 2018-03-20T18:54:57Z

Any news, @vmarkovtsev ?

vmarkovtsev · 2018-03-20T19:56:51Z

Damn I am squeezed but will do my best to review ASAP, sorry

eiso · 2018-03-21T11:58:40Z

I read through the proposal carefully and @campoy and I also discussed it in person. I am a big fan of this approach to information design. It makes it a lot clearer.

A minor comment is on the name predictors. e.g. Given input code, give me all tests that are likely to be related to it. This to me is not a prediction, it's inference.

campoy · 2018-03-21T18:51:48Z

Good point, I replaced predictor by inferencer which is probably more accurate.
Any other reviews are welcome, maybe it's time to discuss this on a meeting?

vmarkovtsev

I am rejecting this proposal. It does not fit into my vision of organizing ML at source{d}. We apparently need to run this through our project workflow: design document, design meetings, kick-off meeting and finally implementation.

vmarkovtsev · 2018-03-22T09:34:39Z

developer-community/datasets-and-models.md

+
+As we start publishing more datasets and models, it is important to keep in mind why we're doing this.
+
+> We publish datasets because we want to contribute back to the Open Source and Machine Learning communities.


It's not only contributing back as it happens with code. We also want to increase popularity of MLonCode and attract more people. A dataset is always the starting point of any DS research. No data => no research.

vmarkovtsev · 2018-03-22T09:44:05Z

developer-community/datasets-and-models.md

+We consider datasets and models to be good when they are:
+- discoverable,
+- reproducible, and
+- reusable.


What is meant by "reusable", it is not a typical concept. I would say that what makes a good dataset is:

discoverable

in-depth documented (this is different from (1)). So many times I saw cool data very poorly described. Also would be nice to see problem suggestions for newbies.

accessibility. Data should be indexed if it makes sense, also the format matters. As I heard from Konstantin the other day, "just give me the darn CSVs, I am tired of fighting with Siva". If the dataset is big and targets Spark, it should not be a single gzipped txt, it should be splittable lzo chunks + a tool for reading those files without Spark. Another example: data distributed through BTSync - nobody will serve it, it is not another hot pirated movie or a cracked AAA game.

reproducible. Few people really want it, but still.

vmarkovtsev · 2018-03-22T09:48:52Z

developer-community/datasets-and-models.md

+
+It seems to be quite established that the relationship between datasets, models, and other concepts is somehow expressed in the following graph.
+
+![dataset graph](graph.png)


"kernel" is not a typical word. It may be used in NN contexts, but in general, it is confusing. I would replace it with a "training algorithm". Usually, it trains the model, not generates it. The same thing with "inferencer", I would replace it with "application".

Awesome chart, I like it a lot!

vmarkovtsev · 2018-03-22T09:51:53Z

developer-community/datasets-and-models.md

+making predictions. Problems are the starting motivation
+and ending point of most of our Machine Learning processes.
+
+Problems have a clear objective, and a measure of success that


This oxford comma really confuses me. I was about to propose changing "let" to "lets" but then realized it was about both things.

vmarkovtsev · 2018-03-22T10:09:02Z

developer-community/datasets-and-models.md

+
+This is normally documented in research papers, with references
+to what datasets and kernels were used, as well as how much
+training time it took to obtain the resulting model.


So this is the thing: there is a huge difference between a research paper and what we want to achieve. Papers are always limited in size and the authors desperately try to squeeze as much information as possible. This often leads to excluding important descriptions, which are not strictly necessary but simplify the reproduction.

Think of it as a physical experiment. A paper includes: initial conditions; methodology; observations throughout the experiment lifetime; the empirical results; explanations and conclusions. Unfortunately, not ML papers.

So a dream model documentation should have:

Thorough dataset documentation

Problem statement. This includes choosing the right quality metric.

Architecture description, metaparameter values. Random seed - this guy is so hard to get right and requires using exactly the same code.

All possible plots how the model was trained. Time observations.

Stability and deviations: how metaparameters influence, how stable is the training. E.g. different random seeds may lead to quite different results in case the model is buggy. Another example: we reduce the number of neurons 100x and get 1% accuracy drop: it is a fair tradeoff for many people.

Results: achieved metrics, examples.

vmarkovtsev · 2018-03-22T10:26:47Z

developer-community/datasets-and-models.md

+- format of the dataset
+- what other datasets (and versions) were used to generate this?
+- what models have been trained with this dataset
+- LICENSE (the tools and scripts are licensed, but not the datasets?)


Some of the issues are to be resolved by attaching the paper.

Yes, but we can't assume having a paper for everything. It won't be feasible from a time perspective.

Absolutely. I meant solely PGA here.

vmarkovtsev · 2018-03-22T10:37:33Z

developer-community/datasets-and-models.md

+```
+
+What are we missing?
+- Versioned models, corresponding to versioned datasets.


There is versioning, though not reflected in src-d/models. Models can derive, either with the relation to parent or not. E.g. it is a common situation when our data engineering and filtering are not perfect and we miss data or pass in garbage. In that case, the relation to the previous model is saved. Sometimes it is just a regular update without the relation to the previous one.

I think the parent relation, makes @campoy's analogy to containers even strong here. There is a lot we can/should learn from how Docker registry tackled this.

vmarkovtsev · 2018-03-22T10:41:15Z

developer-community/datasets-and-models.md

+
+What are we missing?
+- Versioned models, corresponding to versioned datasets.
+- Reference to the code (kernel) that was used to generate the model.


Another important notice: models contain dependencies to upstream models which were used in the generation process. Datasets are also models in this terminology and should have a UUID (yeah, this is confusing, I know).

The only way which I see to reference the code is to record the whole Python package dependency tree. This still misses the actual custom calling code in many cases, and I need to apply some dark Python alchemy to discover and record it. We also need to store it somewhere in the model file.

Since each model references the code it was created from. The dependency tree is there already.

vmarkovtsev · 2018-03-22T10:43:24Z

developer-community/datasets-and-models.md

+
+Since we care about individual versioning of datasets and models,
+it seems like it's an obvious choice to use a git repository per dataset,
+and model.


We will die. Seriously. I tried it and it is completely out of maintenance. Special pain belongs to adding new models and being blocked for a few days until the repository is created. I am strongly against this idea.

There is also the central registry of our models in src-d/models which is of strong necessity as the only way to fetch the index and automatically download models in downstream apps.

We already have 5 models to date, and the only reason why it is so few is that we did not have data. Now that we have PGA, we will bake new models like pies, with tens of different architectures and problems. Models are not code repositories, there is no point in contributing to existing ones, it is always about adding smth new.

Besides, we need to solve the problem with the community, because we want to allow external people to push models into our registry. Think of it as DockerHub for models.

Think of it as DockerHub for models.

What I believe @campoy is proposing is DockerHub for models, datasets, training algorithms, applications etc.

Model is an artifact of a training algorithm. Algorithm is code and we can improve it, fix it, etc. So the algorithms should be on GitHub/Git, separate from the model storage.

vmarkovtsev · 2018-03-22T10:50:04Z

developer-community/datasets-and-models.md

+Since we imagine these tools extracting information from the repositories
+automatically, it is important to keep formatting in mind.
+
+I'm currently considering whether a `toml` file should be defined containing


This will not work. All the metadata should be generated automatically from the self-contained ASDF files. Otherwise this will be a nightmare to support.

eiso · 2018-03-23T11:51:44Z

@vmarkovtsev based on your really nice/insightful feedback. I feel that you're not rejecting the proposal but wanting to amend the implementation.

I feel that @campoy's main point here is, how we build/present/communicate a mental model of how you build on top of the source{d} stack.

I think at this point it makes sense to have a small meeting about this. I would also like to invite our new Head of Architecture to review this proposal @smola

vmarkovtsev · 2018-03-23T15:49:03Z

Yes, this better explains my rationale, thanks Eiso. I really love this graph BTW, and additional ❤️ for using Graphviz to plot it.

campoy · 2018-04-04T18:53:37Z

So what should the next step be?
Should I drop this PR and follow the engineering workflow? I'm totally fine with that

vmarkovtsev · 2018-04-04T19:01:45Z

I would go with a DD (template). It is easier to discuss and fight, also this change would require actions from ML team or even Apps, depending on our depth level.

eiso · 2018-04-06T18:20:40Z

@vmarkovtsev @campoy agree with next steps and looping in @marnovo here.

campoy · 2018-04-09T21:37:06Z

FYI, I'm working on going through with this and creating a Design Document for my ideas on the topic.
I'll update this issue once I have a draft.

campoy · 2018-04-10T01:51:36Z

I wrote an initial Design Document.
For now it's very empty intentionally.

https://docs.google.com/document/d/1EbwfOd4UpVXCprW-9ApPhX-HN6PXHODVbKj4ajJtDfM/edit?usp=sharing

What do you all think?

smola

No major comments from my side. It looks good to me.

smola · 2018-04-10T09:04:00Z

developer-community/datasets-and-models.md

+
+- short description
+- long description and links to papers and blog posts
+- technical sheet


It might seem obvious, but we should also include:

date the dataset was generated

date of retrieved data if applicable (this might be included in the link to the original data sources)

smola · 2018-04-10T09:05:01Z

developer-community/datasets-and-models.md

+
+### Versioning and Releases
+
+Every time a new version of a dataset or model is released a new tag and


Do we want to standarize dataset version to something? Maybe just date, or semver?

semver sounds pretty good, but it's a solution so I don't wanna decide on it just yet
I was considering using Docker hub, which would bring this for free.

smola · 2018-04-26T07:21:02Z

@campoy
@dennwc has proposed JSON-LD and the Dataset schema to annotate datasets:
src-d/datasets#51

More info: https://developers.google.com/search/docs/data-types/dataset

It seems a nice solution instead of doing our own format and schema.

vmarkovtsev · 2018-11-05T13:06:58Z

I am back to this, finally. Reading and editing the document.

vmarkovtsev · 2018-11-21T10:59:47Z

I am frozen, there are some urgent blocking tasks in style-analyzer

vmarkovtsev · 2018-11-22T18:24:55Z

I am back to life. Actually, I have already written most of my thoughts and the document is ready for @campoy's and @marnovo's review.

campoy · 2018-11-26T21:28:30Z

Commented on the doc

[WIP] a proposal to document all datasets and models

8675ab4

Signed-off-by: Francesc Campoy <[email protected]>

campoy force-pushed the docs branch from 40192de to 8675ab4 Compare March 9, 2018 21:28

campoy requested review from eiso and vmarkovtsev March 9, 2018 21:29

eiso requested a review from marnovo March 21, 2018 11:58

predictor -> inferencer

70da984

vmarkovtsev requested changes Mar 22, 2018

View reviewed changes

eiso requested a review from smola March 23, 2018 11:51

vmarkovtsev mentioned this pull request Apr 3, 2018

Enable sharing and publishing MLonCode models src-d/okrs#44

Closed

2 tasks

smola reviewed Apr 10, 2018

View reviewed changes

smola mentioned this pull request Apr 26, 2018

Datasets should be annotated with JSON-LD src-d/datasets#51

Open

marnovo mentioned this pull request Jul 8, 2018

Improve model readme md templates src-d/models#2

Open


		As we start publishing more datasets and models, it is important to keep in mind why we're doing this.

		> We publish datasets because we want to contribute back to the Open Source and Machine Learning communities.


		It seems to be quite established that the relationship between datasets, models, and other concepts is somehow expressed in the following graph.

		![dataset graph](graph.png)


		### Versioning and Releases

		Every time a new version of a dataset or model is released a new tag and

[WIP] a proposal to document all datasets and models #163

Are you sure you want to change the base?

[WIP] a proposal to document all datasets and models #163

Conversation

campoy commented Mar 9, 2018

campoy commented Mar 9, 2018

vmarkovtsev commented Mar 14, 2018

campoy commented Mar 20, 2018

vmarkovtsev commented Mar 20, 2018

eiso commented Mar 21, 2018

campoy commented Mar 21, 2018

vmarkovtsev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmarkovtsev Mar 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmarkovtsev Mar 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmarkovtsev Mar 22, 2018 • edited Loading

Choose a reason for hiding this comment

eiso commented Mar 23, 2018

vmarkovtsev commented Mar 23, 2018

campoy commented Apr 4, 2018

vmarkovtsev commented Apr 4, 2018

eiso commented Apr 6, 2018

campoy commented Apr 9, 2018

campoy commented Apr 10, 2018

smola left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smola commented Apr 26, 2018

vmarkovtsev commented Nov 5, 2018

vmarkovtsev commented Nov 21, 2018

vmarkovtsev commented Nov 22, 2018

campoy commented Nov 26, 2018

vmarkovtsev Mar 22, 2018 •

edited

Loading

vmarkovtsev Mar 22, 2018 •

edited

Loading

vmarkovtsev Mar 22, 2018 •

edited

Loading